WMMA grouped conv fwd large tensor bias bnorm clamp #3595

wj-laskowski · 2026-01-16T13:49:42Z

Proposed changes

Added bias bnorm clamp operation for WMMA conv fwd large tensor (FP16/BF16 data type and NHWGCxGKYXC layout).

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

I have added tests relevant to the introduced functionality, and the unit tests are passing locally
I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
I have added inline documentation which enables the maintainers with understanding the motivation
I have removed the stale documentation which is no longer relevant after this pull request
(If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
I have run clang-format on all changed files
Any dependent changes have been merged

Discussion

If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered

...eration/gpu/device/impl/device_grouped_conv_fwd_multiple_d_wmma_cshuffle_v3_large_tensor.hpp

krithalith · 2026-01-20T09:50:17Z

Looks good overall! I have some comments:

They wanted only a single generic instance per instance list for (Large Tensor) bias bnorm clamp. You can just make two new instance lists that only contain one generic instance and use them only for bias bnorm clamp. We did the same for the non-large version. Please first find an actual generic instance, because it seems like the currents instance lists do not contain one. You can probably adapt them from the XDL Large Tensor instance lists. They should have all the scalarPerVector values equal to 1.
Are the current bias bnorm clamp tests sufficient for testing the Large implementation? I.e. are the tensors large enough to actually cause splitting? If not it might be useful to add a "Large Cases" test like for the other flavors.
Did you check if the Large Tensor implementation is actually run and can support all the test cases (especially check after reducing instance lists to generic only)?

...tance/gpu/grouped_conv_fwd/device_grouped_conv_fwd_wmma_cshufflev3_large_tensor_instance.hpp

bartekxk

lgtm please rebase

Copilot

Pull request overview

This pull request adds WMMA (Wave Matrix Multiply Accumulate) kernel support for grouped convolution forward operations with bias, batch normalization, and clamp activation for large tensors. The implementation targets FP16 and BF16 data types with NHWGC/GKYXC/NHWGK tensor layouts.

Changes:

Added WMMA kernel instances for 2D and 3D grouped convolutions with bias+bnorm+clamp operations
Updated test infrastructure to enable bias+bnorm+clamp tests on gfx9, gfx11, and gfx12 GPU targets
Modified initialization ranges in the profiler to ensure monotone operations, improving numerical stability on RDNA3 architectures

Reviewed changes

Copilot reviewed 23 out of 23 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
test/grouped_convnd_fwd_activation/test_grouped_convnd_fwd_gk_bias_bnorm_clamp.cpp	Updated function call from bias_clamp to bias_bnorm_clamp
test/grouped_convnd_fwd_activation/test_grouped_convnd_fwd_bias_bnorm_clamp.cpp	Updated function call from bias_clamp to bias_bnorm_clamp
test/grouped_convnd_fwd_activation/CMakeLists.txt	Moved bias_bnorm_clamp tests to gfx9/11/12 target section
test/CMakeLists.txt	Added test to regression test list
profiler/src/profile_grouped_conv_fwd_bias_bnorm_clamp.cpp	New profiler entry point for bias_bnorm_clamp operation
profiler/src/CMakeLists.txt	Added profiler source and device instances to build
profiler/include/profiler/profile_grouped_conv_fwd_bias_bnorm_clamp_impl.hpp	Changed initialization ranges to monotone values for numerical stability
library/src/tensor_operation_instance/gpu/grouped_conv3d_fwd_bias_bnorm_clamp/wmma/large_tensor/*.cpp	WMMA large tensor instances for 3D conv (f16/bf16)
library/src/tensor_operation_instance/gpu/grouped_conv3d_fwd_bias_bnorm_clamp/wmma/*.cpp	WMMA instances for 3D conv (f16/bf16)
library/src/tensor_operation_instance/gpu/grouped_conv3d_fwd_bias_bnorm_clamp/CMakeLists.txt	Updated build configuration to include WMMA instances
library/src/tensor_operation_instance/gpu/grouped_conv2d_fwd_bias_bnorm_clamp/wmma/large_tensor/*.cpp	WMMA large tensor instances for 2D conv (f16/bf16)
library/src/tensor_operation_instance/gpu/grouped_conv2d_fwd_bias_bnorm_clamp/wmma/*.cpp	WMMA instances for 2D conv (f16/bf16)
library/src/tensor_operation_instance/gpu/grouped_conv2d_fwd_bias_bnorm_clamp/CMakeLists.txt	Updated build configuration to include WMMA instances
library/include/ck/library/tensor_operation_instance/gpu/grouped_convolution_forward_bias_bnorm_clamp_wmma_cshufflev3.inc	Forward declarations for WMMA instance functions
library/include/ck/library/tensor_operation_instance/gpu/grouped_convolution_forward_bias_bnorm_clamp.hpp	Factory integration for WMMA instances
library/include/ck/library/tensor_operation_instance/gpu/grouped_conv_fwd/device_grouped_conv_fwd_wmma_cshufflev3_large_tensor_instance.hpp	Generic instance templates for WMMA large tensor
library/include/ck/library/tensor_operation_instance/gpu/grouped_conv_fwd/device_grouped_conv_fwd_wmma_cshufflev3_instance.hpp	Generic instance templates and type aliases
include/ck/utility/array.hpp	Added Emplace method for in-place construction
include/ck/tensor_operation/gpu/device/impl/device_grouped_conv_fwd_multiple_d_wmma_cshuffle_v3_large_tensor.hpp	Refactored to use Emplace and added NumDTensor guards

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

...a/device_grouped_conv3d_fwd_bias_bn_clamp_wmma_cshufflev3_nhwgc_gkyxc_nhwgk_f16_instance.cpp

.../device_grouped_conv3d_fwd_bias_bn_clamp_wmma_cshufflev3_nhwgc_gkyxc_nhwgk_bf16_instance.cpp

The base branch was changed.

Following operations are added for FP16/BF16 data type and NHWGCxGKYXC layout. - grouped_conv2d_fwd_bias_bnorm_clamp - grouped_conv3d_fwd_bias_bnorm_clamp

wj-laskowski added the organization: streamhpc label Jan 16, 2026

wj-laskowski force-pushed the streamhpc/grouped-conv-fwd-wmma-large-tensor-bias_bnorm_clamp branch from 7b0341d to b5c541f Compare January 19, 2026 09:35

wj-laskowski changed the title ~~Streamhpc/grouped conv fwd wmma large tensor bias bnorm clamp~~ WMMA grouped conv fwd large tensor bias bnorm clamp Jan 19, 2026

wj-laskowski marked this pull request as ready for review January 19, 2026 12:58

wj-laskowski requested review from a team, ThomasNing, afagaj, andriy-ca, aosewski, asleepzzz, bartekxk, carlushuang, cgmillette, coderfeli, geyyer, illsilin, poyenc, qianfengz, shumway, tenpercent and vidyasagar-amd as code owners January 19, 2026 12:58

chris-tsiaousis-hpc reviewed Jan 19, 2026

View reviewed changes

...eration/gpu/device/impl/device_grouped_conv_fwd_multiple_d_wmma_cshuffle_v3_large_tensor.hpp Outdated Show resolved Hide resolved

krithalith self-requested a review January 20, 2026 09:53

illsilin assigned bartekxk Jan 20, 2026

wj-laskowski requested review from Snektron and vpietila-amd as code owners January 21, 2026 06:40

wj-laskowski mentioned this pull request Jan 21, 2026

WMMA grouped conv fwd large tensor extra flavors #3582

Merged

7 tasks

krithalith reviewed Jan 21, 2026

View reviewed changes

...tance/gpu/grouped_conv_fwd/device_grouped_conv_fwd_wmma_cshufflev3_large_tensor_instance.hpp Outdated Show resolved Hide resolved

ApoorvaKalyani force-pushed the streamhpc/conv_fwd_wmma_bias_bnorm_clamp branch from 23c6688 to 7306d8b Compare January 21, 2026 10:51

krithalith previously approved these changes Jan 21, 2026

View reviewed changes

bartekxk requested a review from Copilot January 21, 2026 11:21

Copilot started reviewing on behalf of bartekxk January 21, 2026 11:22 View session

bartekxk previously approved these changes Jan 21, 2026

View reviewed changes

Copilot AI reviewed Jan 21, 2026

View reviewed changes

...a/device_grouped_conv3d_fwd_bias_bn_clamp_wmma_cshufflev3_nhwgc_gkyxc_nhwgk_f16_instance.cpp Outdated Show resolved Hide resolved

.../device_grouped_conv3d_fwd_bias_bn_clamp_wmma_cshufflev3_nhwgc_gkyxc_nhwgk_bf16_instance.cpp Outdated Show resolved Hide resolved

wj-laskowski force-pushed the streamhpc/grouped-conv-fwd-wmma-large-tensor-bias_bnorm_clamp branch from 35f342c to 427b259 Compare January 21, 2026 12:13

Base automatically changed from streamhpc/conv_fwd_wmma_bias_bnorm_clamp to develop January 22, 2026 08:54

wj-laskowski added 4 commits January 22, 2026 09:10

Added bias_bnorm_clamp for WMMA conv fwd large tensor.

24b6bd7

Following operations are added for FP16/BF16 data type and NHWGCxGKYXC layout. - grouped_conv2d_fwd_bias_bnorm_clamp - grouped_conv3d_fwd_bias_bnorm_clamp

changed strategy to handle GemmArgs array

2052f4b

Adding generic instance

1aef63f

fixed last nits from reviewers and copilot

41ca771

wj-laskowski force-pushed the streamhpc/grouped-conv-fwd-wmma-large-tensor-bias_bnorm_clamp branch from 427b259 to 41ca771 Compare January 22, 2026 09:11

wj-laskowski requested a review from bartekxk January 23, 2026 07:26

bartekxk approved these changes Jan 23, 2026

View reviewed changes

wj-laskowski merged commit 2e08a7e into develop Jan 23, 2026
22 checks passed

wj-laskowski deleted the streamhpc/grouped-conv-fwd-wmma-large-tensor-bias_bnorm_clamp branch January 23, 2026 11:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WMMA grouped conv fwd large tensor bias bnorm clamp #3595

WMMA grouped conv fwd large tensor bias bnorm clamp #3595

Uh oh!

wj-laskowski commented Jan 16, 2026 •

edited

Loading

Uh oh!

Uh oh!

krithalith commented Jan 20, 2026

Uh oh!

Uh oh!

bartekxk left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

WMMA grouped conv fwd large tensor bias bnorm clamp #3595

WMMA grouped conv fwd large tensor bias bnorm clamp #3595

Uh oh!

Conversation

wj-laskowski commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed changes

Checklist

Discussion

Uh oh!

Uh oh!

krithalith commented Jan 20, 2026

Uh oh!

Uh oh!

bartekxk left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

wj-laskowski commented Jan 16, 2026 •

edited

Loading